Decision Tree | Tree Ensemble | XGBoost - utkaln/machine-learning GitHub Wiki

Decision Tree Basics

  • Classify based on features that intermittently cluster similar items together, eventually driving to final classification
  • Select the feature that can quickly Narrow down to decisive results

decision-tree v1

Decision 1 : Explore Features

  • A good feature is the one that helps arrive at pure form quicker

Decision 2: When to stop splitting

  • When further splitting does not improve beyond minimum threshold

Concept of Entropy

  • Measurement of impurity
  • p0 and p1 are the most pure. p0.5 is the most impure
  • Equation : H(p1) = -p1 * log2(p1) - p0 * log2(p0)

entropy

How to choose which split to go to

  • This is driven by a concept called Information Gain which is calculated using the concepts of Entropy
  • At a Decision Node, a weighted average of probability of left node and right node is calculated. The decision tree that shows the biggest difference from the probability at the root is the most preferred split to go to

Illustration below:

decision-tree-InformationGain

Summary of Decision Tree Flow

  1. Start with all examples at Root Node
  2. Calculate Information Gain on all possible features and choose the one with highest gain
  3. Split data according to the selected feature
  4. Keep Repeating Splitting until -
  • When a node has 100% of the class
  • When Information Gain less than Threshold value
  • When Splitting a result will Exceed Maximum Depth of Tree Decided
  • When Number of Examples in a node is Below Threshold

Tree Ensemble

  • Multiple independent decision trees are used to make a decision based on majority
  • This is achieved by creating sample with replacement
  • The optimal number of rounds of creating sample is about 100. Larger sample usually does not provide any significant higher accuracy

XGBoost

  • Full form: eXtreme Gradient Boosting
  • Algorithm that works with Tree Ensemble by providing more focus on the misclassified predictions from previous round
  • It reduces error using the following mechanism - In contrast to Random Forest which creates unrelated decision trees, XGBoost creates trees fitting one after the other to minimize the error

Decision Tree Vs. Neural Network

Decision Tree Neural Network
Works well on Structured Data Works well on Structured and Unstructured Data
Recommended for Tabular Data Recommended for Speech, Text, Video type Data
Faster Processing Slower Processing
Human Interpretable Not Easy for Humans to Interpret
Can't leverage Transfer Learning Transfer Learning can help improve accuracy
Mostly works as one model for one system Multiple Networks can be strung together in a System to build with multiple models

Example Code